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Abstract 

An oscillation with a period of around 500 kb in guanine and cytosine content (GC%) 
is observed in the DNA sequence of human chromosome 21. This oscillation is localized in 
the rightmost one-eighth region of the chromosome, from 43.5 Mb to 46.5 Mb. Five cycles 
of oscillation are observed in this region with six GC-rich peaks and five GC-poor valleys. 
The GC-poor valleys comprise regions with low density of CpG islands and, alternating 
between the two DNA strands, low gene density regions. Consequently, the long-range 
oscillation of GC% result in spacing patterns of both CpG island density, and to a lesser 
extent, gene densities. 
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1 Introduction 

Periodicities and characteristic length scales in biological sequences have been of long- 
standing interest in sequence analysis. The codon structure and the corresponding length 
scale of three bases in protein coding sequences (Shulman et al., 1981; Fickett, 1982; 
Staden and McLachlan, 1982) is one of the main features being used for computational 
gene recognization (see, e.g., Borodovsky et al., 1986; Borodovsky and Mclninch, 1993; 
Burge and Karhn, 1997; Tiwari et al, 1997; Guigo, 1999; Grosse et al., 2000; Li et al., 
2002). by a tendency for certain base types to be at certain codon positions. A correlation 
at a distance of 10-11 bases was also detected Trifonov and Sussman, 1980; Baldi et al., 
1996; Widom, 1996; Tomita et al., 1999). Two possible explanations have been put forward 
to explain this (Herzel et al., 1998). The first explanation is that this periodicity is related 
to DNA bending and nucleosome formation (Trifonov and Sussman, 1980). The second 
explanation is that it is a reflection of a periodicity in protein sequences (Zhurkin, 1981), 
because most sequences that exhibit 10-11 base correlations are protein coding sequences 
(see, however, the result in (Holste et al., 2003), where a 10-11 base correlation is detected 
in mostly non-coding human chromosomes 20,21,22 sequences). Proponents of the second 
explanation also argue that nucleosome formation is only aided by a sequence property 
on 5-6 bases (Zhurkin, 1981). Periodicities in protein sequences are seen to be natural, 
because their presence aids the secondary structure (Shiba et al., 2002). Besides these two 
well known length scales in DNA sequences, 3 and 10-11 bases, other length scales, e.g., 
120, 200, and 450 bases, were also proposed, based mainly on a theoretical argument on 
the size of an exon, the size of nucleosome unit, and the size of a typical prokaryote protein 
(Trifonov, 1998). 

Tandem repeats of DNA segments introduce local or even global sequence periodicities 
depending on their distribution. If a sequence of k bases tandemly repeats many times, 
there would be base-base correlations at separations of k, 2k, 3k, . . . bases, e.g., the period- 
icity of k —2 bases observed in non-coding sequences (Arques, 1987; Konopka et al., 1987). 
If the repeat is not perfect, such as those in many subtelometric sequences, a correlation 
may not appear at the exact multiples of the basic unit size (Pizzi et al., 1990). Interspersed 
repeats in mammalian genomes (Smit, 1999) should in principle not introduce character- 
istic length scales in correlation patterns because their spatial distribution is not regular 
(except for length scales shorter than the size of one copy of the repeat) . A recent survey 
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of characteristic length scales in many eukaryote genome sequences by spectral analysis 
reveals peaks in power spectra at about 68, 59 and 94 bases in C. elegans (chromosomes 
I, II, and III, respectively), at about 248, 167, 126 bases in A. thaliana chromosome 3, at 
about 174, 88, 59 bases in chromosome 4, at about 356, 174, 88, 59 bases in chromosome 
5, and at about 167, 84 bases in H. sapiens chromosomes 21 and 22 (Fukushima et al., 
2002). A connection between these length scales and tandem or interspersed repeats for 
the three human chromosomes (20, 21, and 22) has been discussed in (Holste et al., 2003). 

At distances much longer than hundreds of bases, it is more difficult to observe a 
correlation at the level of individual bases. It is, however, easier to observe correlations 
between base compositions, such as the guanine and cytosine content (GC%). The reason 
is as follows: instead of requiring a matching base by base at the exact spacing, correlation 
at the base composition level only requires GC% to be similarly high (or low) at certain 
range of spacings. In this paper, we report an unusual long-range oscillation of GC% with a 
periodicity of around 500 kb (1 kb= 10'^ bases) in the DNA sequence of human chromosome 
21. This periodicity is longer than any periodicity in DNA sequences detected so far. 



2 The DNA sequence of human chromosome 21 exhibits higher 
correlations 

Sequence data were downloaded from the UCSC human genome repository (available 
at http://genonie.ucsc.edu/ ) , for the version of NCBI build-34 release. We evenly par- 
tition each human chromosome into N = 2^ non-overlapping windows (e.g., k = 17 
and =131,072). GC% of each window is calculated, forming a GC% series: {xi} 
{i = 1,2, . . . A^). The correlation function T{d) of this series is defined as the Pearson's 
correlation coefficient of two truncated subseries: a right-hand side truncated {x'} = {xi} 
{i = 1,2, ... N — d), and a left-hand side truncated {x"} = {xi} {i = d + 1, d + 2, . . . N): 

rud) ^ , (1) 

Y/Var(x')^Var(a;") 

where the covariance is defined as Cov(x', x") = {{x' — {x')){x" — {x"))) and the variance is 
defined as Var(x) = {{x — (x))"^) (() is the average of the {x} series). Here, the parameter 
w is used to indicate the fact that the GC% series, and thus detected patterns of correla- 
tion, implicitly depends on the window size w. When the spacing d N, the following 
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approximation formula can be used: 

Var(x') ^ Var(a;") ^ Var(x), r^(rf) ^ "^''^ (2) 

Var(x) 

For each human chromosome, Fig.l shows T{d) for both the GC% series derived from 2^"^ 
windows (bottom) and the GC% series derived from 2^^ windows (top). Fig.l shows that 
the magnitude of T{d) depends on the window size. 

We illustrate how the magnitude of the correlation function depends on the window 
size by plotting in Fig. 2 the correlation T^id) at distance close to d ^ 1 Mb (1 Mb = 
10^ bases) for each human chromosome versus the chromosome-specific window size w. 
All w values are within the range of 0.3-2 kb. In a double-logarithmic representation, 
log{r yu{d = 1 Mb)) for the majority of human chromosomes follows a linear function of 
log(w), and thus r^(c/ = 1 Mb) ~ (6 > 0) approximates a power-law function. 

The dependence of Tyj{d) on the window size w shown in Fig. 2 was previously observed 
(Li and Holste, 2004). It can be explained as follows. As can be seen from Eq.®, 
Tyj{d) depends on both Cov() and Var(). While the denominator Var(GC%) is dependent 
on w and gradually decreases with increasing window size, the nominator Cov(GC%) 
is practically independent of w. Experimentally, a slower-than-expected decrease of Var() 
with the window size was already observed around 1976 (Macaya et al., 1976). For random 
symbolic sequences, it can be shown by the binomial distribution that the variance of 
GC% decreases with the window size w according to Var(GC%)~ 1/w (Nekrutenko and 
Li, 2000; Clay et al., 2001). Sequence analysis shows, however, that Var(GC%) ~ 1/w^ 
with < (3 < 1 (Clay et al., 2001; Li and Holste, 2004). This "resistance to reduction of 
variance" is directly related to the 1//" {a ~ 1) power spectrum (Clay et al., 2001; Clay 
et al., 2003; Clay, 2003; Li, 2005) previously observed in DNA sequences (Li and Kaneko, 
1992) by the relationship a ^ 1 — p. 

Combining the lack of influence on Cov() by the window size, and the factor of 
for Var(), T{d) is expected to increase with increasing window size as Tyj{d) ~ w^. Indeed, 
in Fig. 2 log{r^{d)) at d =1 Mb increases with log(w) more or less linearly. Excluding the 
outlying chromosomes (15, 21, 22, X, Y), the regression coefficient in Fig. 2 is /5 ~ 0.52. This 
value of (3 is consistent with decay exponents by the spectral analysis (Li and Holste, 2004), 
but note that (i) it is an average among different chromosomes, whereas the parameter 
fitting in (Li and Holste, 2004) is carried out separately on individual chromosomes; and 
(ii) a particular distance d ^ 1 Mb is chosen. 
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Fig. 2 clearly shows that considering the trend r^„(d) ~ (at d =1 Mb), chromosomes 
15, 22, and Y have lower correlations than an average human chromosome, while chromo- 
some 21 has higher correlations than the remaining chromosomes. The lower correlation 
in chromosome Y is caused by the large portion of unsequenced bases (about 50%) and 
their substitution with random values (Li and Holste, 2004). In the next section, we will 
examine closely the causes of the unusual high correlations in chromosome 21. 

3 An oscillation of 500 kb in the correlation function is located 
at the rightmost one-eighth of the chromosome 

Fig. 3 shows the correlation function T{d) (for GC% series calculated at the window size 
of w =358 bases) as the function of d, for the DNA sequences of human chromosome 21. 
There is a striking oscillation in r{d) which peaks at about 0.5, 1.1, 1.6, and 2.1 Mb with 
an approximately constant spacing between peaks of about 500 kb. From Fig.l, it can be 
seen this long-range oscillation is solely present in human chromosome 21, but not in other 
human chromosomes. As discussed above, a sequence periodicity can be caused either by 
an exact repeat or by a tendency for a particular base to be located in a periodic location. 
In either case, to maintain base-level periodicity for such a distance of hundreds of kb or 
longer requires a selection pressure against insertion and deletion mutations. 

To find out whether this 500 kb periodicity is localized in a particular region on chro- 
mosome 21, we segment chromosome 21 evenly into eight segments. Fig. 4 shows the 
correlation function r{d) for each segment as a function of d. The first and the second seg- 
ments are mostly unsequenced, and hence r(d) is flat as unsequenced bases are substituted 
by random bases. In the next three chromosomal regions (11.7 Mb- 29.3 Mb), essentially 
no apparent correlation structure is present in T{d) at d — 500 kb ~ 2.5 Mb range. The 
gradual decay of T{d) from 500 kb to 1.5 Mb in the sixth chromosomal segment (29.3 Mb- 
35.2 Mb) is mainly due to an onset from an L2-isochore to an Hl-isochore (Bernardi, 2001; 
Pavlicek et al., 2001). The correlation function of the last segment (41.1 Mb-46.9 Mb) 
reveals the source of the 500 kb oscillation: the peak locations are exactly the same as 
those in Fig.3 albeit without the decay trend. This segment corresponds largely to the 
GC-rich isochore of chromosome 21. 

Fig. 5 shows the chromosomal region, from 43.5 Mb to 46.5 Mb, that exhibits the 500 kb 
oscillation. Fig.5(a) shows the GC% calculated from the window size of w = 2,864 bases. 
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Also shown in Fig. 5 (a) is a sinusoidal function with the period of 500 kb. The second 
half of the sinusodial function is shifted from the first half to better fit the oscillation in 
the GC% fluctuation. It can be seen that six GC-high peaks alternate with five GC-low 
valleys, though the third valley is not as low as the others. The distance between two 
GC-high peaks (or GC-low valleys) is approximately equal to 500 kb, with the exception 
of the middle region that has a longer spacing between peaks. It is interesting to note that 
the alteration of GC-high and GC-low regions, but not the regularity of the spacing, had 
implicitly been detected before (e.g. Fig.l of Hattori et al., 2000, and Fig. 3 of Bernardi, 
2001). 

Because interspersed repeats tend to have higher GC% than the rest of the sequence, 
we address the question of whether a regular spatial distribution of the repeat sequences is 
responsible for the observed 500 kb oscillation. Fig.5(b) shows a similar GC% fluctuation 
for the sequence with substituted interspersed repeats. The 500 kb oscillation persists even 
when interspersed repeats are substituted. 

4 Discussion 

In this paper, we have observed a localized 500 kb long oscillation in GC% of human 
chromosome 21. We checked whether the region of chromosome 21 with this oscillation has 
been the focus of investigations in previously large-scale correlation analyses of the human 
genome, and we found that a segmental duplication of size of 200 kb has been identified 
on chromosome 21 (Golfier et al., 2003). However, the region reported in (Golfier et al., 
2003) was in the chromosome band 21q22.1, whereas the last one-eighth segment of the 
human chromosome 21 reported here was in the band 21q22.3. 

The 21q22.3 band is both GC-rich and gene-rich, in marked contrast to the 7 Mb GC- 
poor isochore localized in 21q21.1-21q21.2 (Hattori et al., 2000). There are 68 known genes 
within the position of 43.5 - 46.5 Mb, or roughly one gene per 44 kb. As a comparison, 
there are total 268 genes for the whole chromosome 21 with length of 46.9 Mb, or one 
gene per 175 kb. Some of the genes are of interests to human disease gene mapping. For 
example, a rare autoimmune disease that affects the endocrine glands, called autoimmune 
polyglandular syndrome type I (APSl) or autoimmune polyendocrinopathy-candidiasis- 
ectodermal dystrophy (APECED) (Online Mendelian Inheritance in Man (McKusick, 1998) 
number 240300), was shown to be linked to markers in 21q22.3 (Aaltonen et al., 1994). 
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The linked region is further narrowed down to the autoimmune regulator (AIRE) gene 
(Aaltonen et al., 1997; Nagamine et al, 1997), located at the position 44.561-44.574 Mb. 

Fig. 5(c) shows the location of mapped known genes, using the knownGene field of the 
UCSC genome bioinformatics site ( http:/ /genome. ucsc.edu/goldenPath/gbdDescriptions.htmll . 
Genes on two DNA strands are plotted separately. When known genes on both strands 
are considered together, there is no visible gaps in their spatial distribution. However, 
strand-specific gene distribution seems to match the amplitudes of GC% oscillations. We 
observe that the forward and the reverse strand alternately exhibits comparatively lower 
local gene densities in GC-poor valleys: 1 and 3 for the reverse strand (— ), 2 and 4 for 
the forward strand (-I-). Both strands show a lower local gene density in GC-poor valley 
region 5. 

We next investigated the spatial distribution of quantities that are directly related to 
GC%. Fig. 5(d) shows "long homogeneous genome regions" (Oliver et al., 2001) as detected 
by the program IsoFinder (Oliver et al., 2004). There are two interesting immediate 
observations: Firstly, the third valley is not as GC-poor as predicted by the sinusoidal 
function, which can also be confirmed by examining Fig. 5 (a) and (b). Secondly, there is a 
lack of periodic alternation between the GC-rich and GC-poor segments with comparable 
sizes. This can be explained by the difference in the segment or isochore view of global GC% 
variation. Isochores corresponds to GC% fluctuation that can be approximated by step 
functions. Gradual changes of GC% as captured by sinusoidal functions are approximated 
poorly by step functions. 

Fig. 5(d) also shows the location of CpG islands (map extracted from the UCSC genome 
bioinformatics site). A visual inspection shows that CpG islands are rare in GC-poor 
valleys, in particular valleys No. 1, 3, 4, and 5. It is not a completely unexpected obser- 
vation since one of the criteria for CpG island detection is its GC%. In particular, one of 
the oftenly used methods for CpG island detection requires GC% to be higher than 50% 
(Larsen et al., 1992). The GC-high peaks and GC-poor valleys in Fig. 5 are separated by 
the GC%=50% line, and this may explain why CpG islands are less likely to be found in 
these GC-poor valleys. 

In summary, we have detected a unique long-range oscillation in a localized region in 
human chromosome 21, which is absent in the remaining human chromosomes. It will be 
of interests to determine the key biological features either causing or resulting from this 
ultralong-ranging periodicity in human chromosome 21. While it cannot be excluded that 
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this particular oscillation is due to chance events, one of the promising directions to pursue 
is its connection to chromatin structure and DNA loops, along the similar line of research 
on the connection between these structural units and GC% (Saccone et al., 2002; Bernardi, 
2004). 
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Figure 1: Correlation function r{d) of the GC% series with 2*^ GC values {k=15: dots, top; and k=17: 
lines, bottom), obtained from all 24 human chromosomes (22 autosomal and 2 sex chromosomes). The 
X-axis, in a logarithmic scale, is the distance d converted to units of Mb (10^ bases). 
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Figure 2: Correlation r^(d), at the distance d « 1 Mb, as a function of the chromosome-specific window 
size w (in a double logarithmic plot). Each point represents one human chromosome. The window size w 
is the chromosome length divided by 2^^. 



auto-correlation function for chr21 




1 2 3 

distance (Mb) 



Figure 3: Correlation function T{d) of the GC% scries for the DNA sequence of human chromosome 21. 
GC% is calculated at the window size of 358 bases, which is 1/2^'' of the total chromosome length. The 
distances of 0.5, 1, 1.5, and 2 Mb are marked by long vertical lines, and the spacing of 100 kb is marked 
by short vertical lines. 
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Figure 4: Correlation function r(d) of the GC% series for eight chromosomal regions of the DNA sequence 
of the human chromosome 21. The distances of 0.5, 1, 1.5, and 2 Mb are marked by vertical lines. 
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Figure 5: GC% fluctuations of the last one-eighth segment of the DNA sequence of human chromosome 
21 towards the q-term end. (a) GC% calculated for the window size oiw — 2.864 kb, which is 1/2^^ of the 

whole chromosome length. A sinusoidal function with the period of 500 kb is superimposed on the plot 
to fit the periodic oscillation of GC%. (b) GC% calculated with interspersed repeats replaced by random 
values, then smoothed by means of running medians, using the S-PLUS subroutine smooth . (c) Locations 
of known genes as determined by protein sequences from SWISS-PROT, TrEMBL, and TrEMBL-NEW, 
and their corresponding mRNAs from GenBank, displayed separately for each DNA strand, (d) Locations 
of CpG islands (top) and isochores (bottom). For the isochore map, the GC% of individual isochores is 
indicated by the height of the horizontal bar. 



